Synthetic Sport Datasets

The literature review identified three main methodological categories for synthetic data generation in sports science which include GAN-based, simulation-based and statistical models.

These three categories describe the primary routes through which synthetic sports data are produced each emphasizing different balances between realism reproducibility and computational complexity.

Ranking of Real Sport Datasets

The ranking comparison identifies GAN-based datasets primarily rely on video and image data focused on athlete performance and activity detection, while Statistical datasets encompass tabular, physiological and survey-based data. The evaluation criteria highlights datasets most suitable for each approach, supporting the selection of appropriate data sources for developing synthetic datasets using Statistical or GAN-based methods.

Category Number of Datasets Data Types Population Most Frequent Sports Top 3 (by Score)
GAN-based 16 Video, Image Athlete Multiple, Basketball, Fitness TeamTrack, C-Sports, SportsMOT
Statistical 34 Tabular, Physiological, Medical Record, Survey, Accelerometer Athlete, Multiple Football, Baseball, Basketball, Fitness MTS-5, NCAA-ISP, LLBD

Use Cases

The treemap presents a structured overview of the sport datasets separated into Real and Synthetic sources. Each side groups datasets according to the primary analysis category they support:

Each box within a category represents a dataset and contains its key information, including sport, data type, collection method, variables captured, and the specific use case it supports.